Understanding the Relationship between Pollution and Life Expectancy in the US

A Spatial Analysis

Author

Sofija Kaluderovic

1 Intro

PM2.5 is a type of air pollution that is of greatest concern to public health as its microscopic size allows it to enter the bloodstream. The PM2.5 particle can cause different diseases, such as heart and lung diseases, stroke, and cancer. The leading causes of death in the United States are heart disease and cancer, so it is important to understand how much our health is affected by air pollution. This memo aims to explore the relationship between the PM2.5 level in the air and life expectancy across the US in 2020 and 2019.

2 Data

Life Expectancy by State

The dataset contains the life expectancy at birth of people in all the states in the US (male, female, and total).

Life Expectancy by County

The dataset contains life expectancy in all the US counties from 2015 to 2019 (the website also has other datasets that contain life expectancy data from 2000 until 2015). The datasets contain life expectancy based on gender, race, and age. This memo uses a dataset from 2019 with the life expectancy of all races (“Total”), both genders (“Both”), and at birth (“<1 year”).

Air Pollution Datasets

This website has air pollution information from different monitoring stations in the US for different years. The memo uses datasets from 2019 and 2020.

Counties Coordinates

The dataset contains the coordinates of all the US counties as well as their populations.

3 Objectives

The main objective of this memo is to determine whether there is a correlation between PM2.5 pollution and life expectancy, that is, whether people live shorter in the most polluted areas of the US compared to the less polluted ones. This project could be interesting for:

• Ordinary people who are considering where to live in the US,

• Environmental policy makers, who might consider applying stricter policies in certain US states to prevent pollution,

• Big corporations, which are responsible for a large part of air pollution.

4 Weaknesses of the Analysis

• The analysis takes air pollution as the only independent variable, while there are other factors affecting life expectancy.

• The analysis focuses only on PM2.5 pollution particles while disregarding others (e.g., ozone).

• Only one point in time is considered, while air pollution levels change. The current levels of air pollution might have a greater effect on future generations.

• The analysis has a narrow geographic focus. It considers only the US.

5 State Level Analysis

5.1 US State Level Data, 2020 Mapped

From the map, California and Oregon stand out as the most polluted states in the US (13.61 and 13.62, respectively) in 2020. Wyoming, North Dakota, South Dakota, and Maine are among the least polluted US states in 2020.

The average life expectancy in US states ranges from 71.9 (Mississippi) to 79.2 (Washington) years in 2020.

5.2 Trend between Life Expectancy and Air Pollution by State, 2020

The trend shows a slight negative correlation between air pollution and life expectancy at birth. California and Oregon are influential observations, as they differ from other states in terms of these factors, so they change the slope of the trend. One possible reason why these two states and Washington have the highest levels of average air pollution but also the highest average life expectancy at birth could be these states’ location. As all three of them are located on the US western coast, their proximity to the Pacific Ocean could play a role in regulating air quality and climate conditions, contributing to a decreased effect of air pollution on people.

The graph shows a stronger trend between air pollution and life expectancy at birth after excluding California, Oregon, and Washington.

5.3 Most Polluted Area Approximation

After selecting the most polluted US states, I approximated the most polluted areas in the country and created buffers around them.

5.4 Area with the Lowest Life Expectancy Approximation

After selecting the states with the lowest life expectancy, I approximated the area in the US with the lowest average life expectancy at birth and created a buffer around it.

The two maps show that the area in the south of the US with the highest levels of average pollution is also the area with the lowest average life expectancy at birth. Again, California, Oregon, and Washington are an exception in this case. In order to better understand the correlation between air pollution and life expectancy at birth, this memo will now focus on county-level data.

6 County Level Analysis

6.1 Trend between Life Expectancy and Air Pollution by County, 2019

The negative slope of the trend indicates a negative relationship between life expectancy at birth and air pollution. Two out of the three most polluted counties (Kings County and Tulare County) are in California, and one is in Washington. The trend also shows that the two most polluted counties in California have a much lower average life expectancy at birth than other counties in California (which are at the top in terms of life expectancy).

6.2 Life Expectancy and Distance from the Most Polluted Counties in the US, 2019

This memo will now show the relationship between life expectancy at birth and distance from some of the most polluted counties in the US. First, it will determine all the counties within 200 km of the most polluted counties. Then it will look at how life expectancy at birth changes as the counties are further away from the most polluted county.

Stevens County, Washington

The graph shows that, as counties are further away from Stevens County, there is a slight increase in their average life expectancy at birth. However, Benewah County and Shoshone County (both in Idaho) decrease the slope of the trend as they both have a much lower average life expectancy compared to other counties within 200 km of Stevens County. Both of them have air pollution measurement stations that show relatively high levels of pollution (10.114920 and 9.638841). Therefore, I decided to exclude the two counties from the trend.

The new trend shows a much stronger positive relationship between counties’ life expectancy at birth and their distance from Stevens County.

Tulare County, California

Again, the trend shows a positive relationship between the average life expectancy and distance from the most polluted county.

Kings County, California

Gwinnett County, Georgia

The trend shows a negative relationship between life expectancy and distance from Gwinnett County, Georgia. This means that as counties are further away from Gwinnett County, their life expectancy decreases.

Webb County, Texas

The trend shows a slight negative relationship.

Butler County, Ohio

The trend shows a slight positive relationship.

Winnebago County, Illinois

The trend shows a negative relationship.

Plumas County, California

The trend shows a slight positive relationship.

The three most polluted counties in the US have much higher air pollution levels than other counties (13.09, 12.89, and 12.17). Other observed counties (Gwinnett, Webb, Butler, Winnebago, and Plumas), while still registering elevated average air pollution levels, fall within a narrower range (around 10), which is closer to the pollution levels of other counties. There are also other factors that affect life expectancy at birth, so the correlation between life expectancy and air pollution becomes more apparent after looking at the extreme cases (Stevens County, Tulare County, and Kings County). This could explain the clear positive relationship observed between life expectancy at birth and the distance from the extremely polluted counties, in contrast to other counties. This hypothesis is further explored in this memo by looking at the variation of the counties’ average pollution levels in California, Washington, Ohio, and Georgia.

6.3 Average Pollution by County in Different States

California

Average air pollution in Californian counties ranges from 3.14 (Lake County) to 12.89 (Tulare County). There is a lot of variation in average air pollution among different counties in California.

Washington

Average air pollution in Washington counties ranges from 4.54 (Whatcom County) to 13.09 (Stevens County).

Ohio

Average air pollution in Ohio counties ranges from 6.39 (Whatcom County) to 10.31 (Butler County). All observations on the graph are close together, which indicates less variation in air pollution levels across the state.

Georgia

Average air pollution in Georgian counties ranges from 7.18 (Coffee County) to 10.8 (Gwinnett County). The graph suggests little variation in the counties’ average air pollution levels in the state.

All four graphs have the same range on the y-axis to make the difference in variation in counties’ average air pollution between states more apparent. As the graphs suggest, there is less variation in air pollution across Ohio and Georgia. Therefore, even the most polluted counties in these states are relatively close in terms of pollution to the surrounding counties. Counties in California, on the other hand, have a greater variation in air pollution levels. Additionally, Stevens County in Washington proved to be an extreme observation, as it has a much higher air pollution level than other counties in Washington.

7 Conclusion

Spatial analysis can help understand the effect that different types of environmental pollution (in this case, air pollution) have on life expectancy at birth. This analysis has explored the correlation between life expectancy at birth and air pollution in different parts of the US. The conclusion drawn from the analysis suggests a positive relationship between life expectancy and distance from extremely polluted counties, while this relationship varies in less extreme cases. This memo provides insights into the complex interplay between air pollution and life expectancy that should be further researched in order to improve public health in different countries. The analysis can be further expanded with other variables that affect life expectancy, and it can also be applied to other parts of the world.

8 Appendix

#Reading the libraries
library("sf")
library ("ggplot2")
library("ggthemes")
library("tibble")
library("dplyr")
library("stars")
library("tidyr")

#Reading the US shape file
map_US <- read_sf("./gadm41_USA_1.shp")

#Simplifying it
map_US<-st_simplify(map_US, dTolerance=0.1)

#Defining the coordinates
min_lon_x_us<-(-130)
max_lon_x_us<-(-57)
min_lat_y_us<-(26)
max_lat_y_us<-(53)

#Reading the US pollution in 2020 file
pollution_PM2.5_US <- read.csv(file="./annual_conc_by_monitor_2020.csv")

#Selecting only PM2.5 from the pollution particles as it is of greates concern to public health
pollution_PM2.5_US <- subset(pollution_PM2.5_US, Parameter.Name == "PM2.5 - Local Conditions" & Pollutant.Standard == "PM25 24-hour 2012")

#Turning the CSV file into an SF object
pollution_PM2.5_US <-st_as_sf(pollution_PM2.5_US, coords = c("Longitude", "Latitude"))
#Associating the crs information for the US shapefile to points
pollution_PM2.5_US <- st_set_crs(pollution_PM2.5_US, st_crs(map_US))

#Calculating the average pollution per state by computing the average of observations from all monitoring stations within each state
pollution_in_US <- pollution_PM2.5_US %>%
  group_by(State.Name) %>%
  summarize(average_pollution = mean(Arithmetic.Mean, na.rm = TRUE))

#Performing the spatial join on the dataset containing the pollution information per state and the US shape file
pollution_in_US <- st_join(map_US, pollution_in_US)

#Reading the file containing the US states' life expectancy information from 2020
life_expUS <- read.csv(file= "./life_exp_US2020.csv")

#Joining the life expectancy data to the pollution data
pollution_life_exp <- left_join(pollution_in_US, life_expUS, by = c("State.Name"="State"))
#Removing manually Alaska, Hawaii, and District of Columbia
pollution_life_exp <- pollution_life_exp[-c(2, 9, 12), ]

#Mapping the average pollution level per state in 2020
ggplot() +
  geom_sf(data = pollution_life_exp, aes(fill = average_pollution)) +
  scale_fill_viridis_c(option = "turbo") +
  labs(title= "US Average Pollution per State, 2020",
    fill = "Average\nPollution")+
  coord_sf(xlim = c(min_lon_x_us, max_lon_x_us), 
           ylim = c(min_lat_y_us, max_lat_y_us))
#From the map, we can see that California and Oregon are the most polluted states in the US, while Wyoming, North Dakota, South Dakota, and Maine seem to be the least polluted states in 2020.

#Mapping the average life expectancy per state in 2020
ggplot() +
  geom_sf(data = pollution_life_exp, aes(fill = LE)) +
  scale_fill_viridis_c(option = "turbo") +
  labs(title= "US Average Life Expectancy per State, 2020",
    fill = "Average\nLife Expectancy")+
  coord_sf(xlim = c(min_lon_x_us, max_lon_x_us), 
           ylim = c(min_lat_y_us, max_lat_y_us))
# The highest life expectancy in 2020 in the US is mostly on the Western coast and in Minnesota. The lowest life expectancy, on the other hand, seems to be in Southeast of the US.

#Plotting life expectancy (on the y-axis) against the pollution level (on the x-axis) by state 
ggplot(pollution_life_exp, aes(x = average_pollution, y = LE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Pollution Level", y = "Life Expectancy", title = "Trend between Life Expectancy and Pollution by State")+
    geom_text(aes(label = State.Name), hjust = 0.8, vjust = -1,
              check_overlap = TRUE, 
            size=2.2)+
  theme_economist()
#California  and Oregon are influential observations as they differ from other states in terms of life expectancy and pollution, so they change the slope of the trend.

#Removing the influential observations and Washington (it is an outlier)
pollution_life_exp_no_inf_obs <- pollution_life_exp[-c(4, 35, 45), ]

#Plotting the trend again, now without California, Oregon, and Washington
ggplot(pollution_life_exp_no_inf_obs, aes(x = average_pollution, y = LE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Pollution Level", y = "Life Expectancy", title = "Trend between Life Expectancy and Pollution by State")+
  geom_text(aes(label = State.Name), hjust = 0.8, vjust = -1,
              check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

#Sorting data based on pollution in a decreasing order
pollution_rank <- pollution_life_exp[order(pollution_life_exp$average_pollution, decreasing = TRUE), ]

#Creating most_pol which will include the 14 most polluted states
most_pol <- head(pollution_rank, n=14)

#Creating state centroids
states_ctr <- st_centroid(pollution_life_exp)

#Selecting only the centroids of the most polluted states
most_pol_centroids <- states_ctr[states_ctr$State.Name %in% most_pol$State.Name, ]

#Selecting only the states on the western coast from the most polluted states
west_coast_states <- c("California", "Oregon", "Washington", "Arizona", "Idaho")
west_coast_centroids <- most_pol_centroids[most_pol_centroids$State.Name %in% west_coast_states, ]

#Calculating the centroid of the states on the western coast, which will indicate the area
west_coast_centroids <- st_union(west_coast_centroids)
west_center <- st_centroid(west_coast_centroids)

#Creating a buffer around the centroid of the western area
west_center_buffer <- st_buffer(west_center, dist = 700000)

#Selecting only the states in the south from the most polluted states
south_states <- c("Mississippi", "Texas", "Illinois", "Indiana", "Missouri", "Georgia", "Oklahoma", "Alabama", "Ohio")
south_centroids <- most_pol_centroids[most_pol_centroids$State.Name %in% south_states, ]

#Calculating the centroid of the states in the south, which will indicate the area
south_centroids <- st_union(south_centroids)
south_center <- st_centroid(south_centroids)

#Creating a buffer around the centroid of the southern area
south_center_buffer <- st_buffer(south_center, dist = 700000)

#Defining the US map that will not include the borders between states (for visualization)
us_map <- st_union(pollution_life_exp)

#Choosing the color blind friendly colors from the library viridis
library(viridis)
colors <- c(viridis(2))

#Plotting buffers around the most polluted areas in the US
ggplot() +
  geom_sf(data = us_map) +
  geom_sf(data = west_center_buffer, fill = colors[2]) +
  geom_sf(data = south_center_buffer, fill = colors[2]) +
  labs(title = "Areas with the Highest Pollution in the US in 2020")

#Sorting data based on life expectancy in a decreasing order
life_exp_rank <- pollution_life_exp[order(pollution_life_exp$LE, decreasing = TRUE), ]

#Creating lowest_life_exp that will include 14 states with the lowest life expectancy
lowest_life_exp <- tail(life_exp_rank, n=14)

#Selecting only the centroids of the states with the lowest life expectancy
lowest_life_exp_centroids <- states_ctr[states_ctr$State.Name %in% lowest_life_exp$State.Name, ]

#Selecting only the states in the south from the most polluted states
southern_states <- c("South Carolina", "New Mexico", "Arkansas", "Indiana", "Missouri", "Georgia", "Oklahoma", "Alabama", "Ohio", "Tennessee", "Louisiana", "Kentucky", "West Virginia", "Mississippi")
southern_centroids <- lowest_life_exp_centroids[lowest_life_exp_centroids$State.Name %in% southern_states, ]

#Calculating the centroid of the states in the south, which will indicate the area
southern_centroids <- st_union(southern_centroids)
southern_center <- st_centroid(southern_centroids)

#Creating a buffer around the centroid of the southern area
southern_center_buffer <- st_buffer(southern_center, dist = 700000)

#Plotting buffer around the area with the lowest life expectancy in the US
ggplot() +
  geom_sf(data = us_map) +
  geom_sf(data = southern_center_buffer, fill = colors[1]) +
  labs(title = "Area with the Lowest Life Expectancy in the US in 2020")

#County Level

#Reading the US shape file (with counties)
map_US2 <- read_sf("./gadm41_USA_2.shp")

#Simplifying it
map_US2<-st_simplify(map_US2, dTolerance=0.1)

#Reading the US average pollution from 2019 file
county_pollution_US <- read.csv("./pollution/annual_conc_by_monitor_2019.csv")
#Selecting only PM2.5 from the pollution particles as it is of greates concern to public health
county_pollution_PM2.5 <- subset(county_pollution_US, Parameter.Name == "PM2.5 - Local Conditions" & Pollutant.Standard == "PM25 24-hour 2012")

#Turning the CSV file into an SF object
county_pollution_PM2.5 <-st_as_sf(county_pollution_PM2.5, coords = c("Longitude", "Latitude"))
#Associating the crs information for the US shapefile to points
county_pollution_PM2.5 <- st_set_crs(county_pollution_PM2.5, st_crs(map_US2))

#Creating a column that will containg both the name of the county and the name of the state
county_pollution_PM2.5$Location <- paste(county_pollution_PM2.5$County.Name, county_pollution_PM2.5$State.Name, sep = ", ")

#Calculating the average pollution per county by computing the average of observations from all monitoring stations within each county
county_pol <- county_pollution_PM2.5 %>%
  group_by(Location) %>%
  summarize(average_pollution = mean(Arithmetic.Mean, na.rm = TRUE))

#Performing spatial join on the dataset containing the county pollution and the US map
county_pol <- st_join(map_US2, county_pol)

#Reading the life expectancy in 2019 file 
county_life_expUS <- read.csv(file= "./life_exp/Life_Expectancy2019.csv")

#Joining life expectancy data to the pollution data based on the columns "Location"
county_pollution_life_exp <- left_join(county_pol, county_life_expUS, by = c("Location"="Location"))
#Removing the counties that don't have life expectancy data
county_pollution_life_exp <- na.omit(county_pollution_life_exp)

#Reading the file that contains counties' coordinates (since there is more data about life expectancy then about pollution, but the file containing life expectancy does not have the counties' coordinates)
counties_coordinates <- read.csv(file="./counties_coordinates/uscounties.csv")

#Creating a new column that will contain both the name of the county and the name of the state per each observation
counties_coordinates$Location <- paste(counties_coordinates$county, counties_coordinates$state_name, sep = ", ")

# Removing duplicates from counties_coordinates
counties_coordinates <- counties_coordinates[!duplicated(counties_coordinates$Location), ]

# Removing duplicates from county_life_expUS
county_life_expUS <- county_life_expUS[!duplicated(county_life_expUS$Location), ]

#Joining counties' coordinates to the life expectancy file
county_life_expUS <- left_join(county_life_expUS, counties_coordinates, by = c("Location"="Location"))
county_life_expUS <- na.omit(county_life_expUS)

#Turning the CSV file into an SF object
county_life_expUS <-st_as_sf(county_life_expUS, coords = c("lng", "lat"))
#Associating the crs information for the US shapefile to points
county_life_expUS <- st_set_crs(county_life_expUS, st_crs(map_US2))

#Performing spatial join on life expectancy and pollution datasets
county_life_exp_pollution <- st_join(county_life_expUS, county_pol)
#The new dataset (county_life_exp_pollution) contains 2949 observations while the old one (county_pollution_life_exp) contains only 596 observations

#Plotting life expectancy (on the y-axis) against the pollution level (on the x-axis by county) 
ggplot(county_pollution_life_exp, aes(x = average_pollution, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Pollution Level", y = "Life Expectancy", title = "Trend between Life Expectancy and Pollution by County")+
  geom_text(aes(label = Location), hjust = 0.45, vjust = -1,
              check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

#Creating county centroids
county_ctr <- st_centroid(county_life_exp_pollution)

#Sorting data based on counties' average pollution in a decreasing order
county_pollution_rank <- county_pollution_life_exp[order(county_pollution_life_exp$average_pollution, decreasing = TRUE), ]

#Defining the most polluted county
most_polluted_county <- head(county_pollution_rank, n =1)

#Selecting the most polluted county's centroid
most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

library(units)
#Calculating distance of all other counties from the most polluted county (Stevens County, Washington)
county_life_exp_pollution$distance_stevens = st_distance(county_ctr, most_pol_county_ctr)
# Converting the distances to km
county_life_exp_pollution$distance_stevens_km <- set_units(county_life_exp_pollution$distance_stevens, 'km')

#Turning the distances into numeric values
county_life_exp_pollution$distance_stevens_km <- as.numeric(county_life_exp_pollution$distance_stevens_km)

#Sorting counties based on their distance from the most polluted county
county_distance_stevens_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_stevens_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Stevens County
closest_counties_stevens <- tail(county_distance_stevens_rank, n =17)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Stevens County, Washington
ggplot(closest_counties_stevens, aes(x = distance_stevens_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Stevens County, Washington", y = "Life Expectancy", title = "Life Expectancy versus Distance from Stevens County")+
  geom_text(aes(label = Location.x), hjust = 0.35, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()
#Shoshone County, Idaho and and Benewah County, Idaho both have the pollution measurement stations and have the relatively high levels of pollution (9.638841 and 10.114920 respectively), so I decided to look at a trend without these two counties.

#Removing observations corresponding to "Shoshone" and "Benewah"
closest_counties_stevens <- subset(closest_counties_stevens, NAME_2 != "Shoshone" & NAME_2 != "Benewah")

#Plotting the trend againg without the two states
ggplot(closest_counties_stevens, aes(x = distance_stevens_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Stevens County, Washington", y = "Life Expectancy", title = "Life Expectancy versus Distance from Stevens County")+
  geom_text(aes(label = Location.x), hjust = 0.35, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

#Assigning Tulare County, the second most polluted county, to most_polluted_county
most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Tulare, California", ]

#Selecting its centroid
most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

#Calculating other counties' distance from Tulare County
county_life_exp_pollution$distance_tulare = st_distance(county_ctr, most_pol_county_ctr)
#Converting to km
county_life_exp_pollution$distance_tulare_km <- set_units(county_life_exp_pollution$distance_tulare, 'km')

#Turning distances into numeric values
county_life_exp_pollution$distance_tulare_km <- as.numeric(county_life_exp_pollution$distance_tulare_km)

#Sorting other counties based on their distance from Tulare County
county_distance_tulare_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_tulare_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Tulare County
closest_counties_tulare <- tail(county_distance_tulare_rank, n =10)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Tulare County, California
ggplot(closest_counties_tulare, aes(x = distance_tulare_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) + 
  labs(x = "Distance from Tulare County, California", y = "Life Expectancy", title = "Life Expectancy versus Distance from Tulare County")+
  geom_text(aes(label = county), hjust = 0.35, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Kings, California", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_kings = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_kings_km <- set_units(county_life_exp_pollution$distance_kings, 'km')

county_life_exp_pollution$distance_kings_km <- as.numeric(county_life_exp_pollution$distance_kings_km)

county_distance_kings_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_kings_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Kings County

#Plotting life expectancy (on the y-axis) against the counties' distance from the Kings County, California
ggplot(closest_counties_kings, aes(x = distance_kings_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Kings County, California", y = "Life Expectancy", title = "Life Expectancy versus Distance from Kings County")+
  geom_text(aes(label = county), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Gwinnett, Georgia", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_gwinnett = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_gwinnett_km <- set_units(county_life_exp_pollution$distance_gwinnett, 'km')

county_life_exp_pollution$distance_gwinnett_km <- as.numeric(county_life_exp_pollution$distance_gwinnett_km)

county_distance_gwinnett_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_gwinnett_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Gwinnett County
closest_counties_gwinnett <- tail(county_distance_gwinnett_rank, n =127)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Gwinnett County, Georgia
ggplot(closest_counties_gwinnett, aes(x = distance_gwinnett_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Gwinnett County, Georgia", y = "Life Expectancy", title = "Life Expectancy versus Distance from Gwinnett County")+
  geom_text(aes(label = county), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Webb, Texas", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_webb = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_webb_km <- set_units(county_life_exp_pollution$distance_webb, 'km')

county_life_exp_pollution$distance_webb_km <- as.numeric(county_life_exp_pollution$distance_webb_km)

county_distance_webb_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_webb_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Webb County
closest_counties_webb <- tail(county_distance_webb_rank, n =23)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Webb County, Texas
ggplot(closest_counties_webb, aes(x = distance_webb_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Webb County, Texas", y = "Life Expectancy", title = "Life Expectancy versus Distance from Webb County")+
  geom_text(aes(label = Location.x), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Butler, Ohio", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_butler = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_butler_km <- set_units(county_life_exp_pollution$distance_butler, 'km')

county_life_exp_pollution$distance_butler_km <- as.numeric(county_life_exp_pollution$distance_butler_km)

county_distance_butler_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_butler_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Butler County
closest_counties_butler <- tail(county_distance_butler_rank, n =134)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Butler County, Ohio
ggplot(closest_counties_butler, aes(x = distance_butler_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Butler County, Ohio", y = "Life Expectancy", title = "Life Expectancy versus Distance from Butler County")+
  geom_text(aes(label = Location.x), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Winnebago, Illinois", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_winnebago = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_winnebago_km <- set_units(county_life_exp_pollution$distance_winnebago, 'km')

county_life_exp_pollution$distance_winnebago_km <- as.numeric(county_life_exp_pollution$distance_winnebago_km)

county_distance_winnebago_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_winnebago_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Winnebago County
closest_counties_winnebago <- tail(county_distance_winnebago_rank, n =69)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Winnebago County, Illinois
ggplot(closest_counties_winnebago, aes(x = distance_winnebago_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Winnebago County, Illinois", y = "Life Expectancy", title = "Life Expectancy versus Distance from Winnebago County")+
  geom_text(aes(label =county), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

most_polluted_county <- county_pollution_rank[county_pollution_rank$Location == "Plumas, California", ]

most_pol_county_ctr <- county_ctr[county_ctr$Location.x == most_polluted_county$Location, ]

county_life_exp_pollution$distance_plumas = st_distance(county_ctr, most_pol_county_ctr)

county_life_exp_pollution$distance_plumas_km <- set_units(county_life_exp_pollution$distance_plumas, 'km')

county_life_exp_pollution$distance_plumas_km <- as.numeric(county_life_exp_pollution$distance_plumas_km)

county_distance_plumas_rank <- county_life_exp_pollution[order(county_life_exp_pollution$distance_plumas_km, decreasing = TRUE), ]

#Selecting the closest counties within 200km from Plumas County
closest_counties_plumas <- tail(county_distance_plumas_rank, n = 23)

#Plotting life expectancy (on the y-axis) against the counties' distance from the Plumas County, California
ggplot(closest_counties_plumas, aes(x = distance_plumas_km, y = Life.Expectancy)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +  
  labs(x = "Distance from Plumas County, California", y = "Life Expectancy", title = "Life Expectancy versus Distance from Plumas County")+
  geom_text(aes(label = county), hjust = 0.52, vjust = -1, 
            check_overlap = TRUE, 
            size=2.2)+
  theme_economist()

#Selecting only the counties in California
california_data <- subset(county_pollution_life_exp, NAME_1 == "California")

#Plotting the average pollution levels (on the y-axis) of the counties in California in order to see the variation of the average pollution levels in the state
ggplot() +
  geom_point(data = california_data, aes(x = NAME_2, y = average_pollution), color = "blue") +
  labs(title = "Average Pollution in California by County",
       x = "County",
       y = "Average Pollution") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  coord_cartesian(ylim = c(3, 13))

#Selecting only the counties in Washington
washington_data <- subset(county_pollution_life_exp, NAME_1 == "Washington")

#Plotting the average pollution levels (on the y-axis) of the counties in Washington in order to see the variation of the average pollution levels in the state
ggplot() +
  geom_point(data = washington_data, aes(x = NAME_2, y = average_pollution), color = "blue") +
  labs(title = "Average Pollution in Washington by County",
       x = "County",
       y = "Average Pollution") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  coord_cartesian(ylim = c(3, 13))

#Selecting only the counties in Ohio
ohio_data <- subset(county_pollution_life_exp, NAME_1 == "Ohio")

#Plotting the average pollution levels (on the y-axis) of the counties in Ohio in order to see the variation of the average pollution levels in the state
ggplot() +
  geom_point(data = ohio_data, aes(x = NAME_2, y = average_pollution), color = "blue") +
  labs(title = "Average Pollution in Ohio by County",
       x = "County",
       y = "Average Pollution") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  coord_cartesian(ylim = c(3, 13))

#Selecting only the counties in Georgia
georgia_data <- subset(county_pollution_life_exp, NAME_1 == "Georgia")

#Plotting the average pollution levels (on the y-axis) of the counties in Georgia in order to see the variation of the average pollution levels in the state
ggplot() +
  geom_point(data = georgia_data, aes(x = NAME_2, y = average_pollution), color = "blue") +
  labs(title = "Average Pollution in Georgia by County",
       x = "County",
       y = "Average Pollution") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  coord_cartesian(ylim = c(3, 13))